Distributed multi-camera multi-target association for real-time tracking

Tracking and associating different views of the same target across moving cameras is challenging as its appearance, pose and scale may vary greatly. Moreover, with multiple targets a management module is needed for new targets entering and old targets exiting the field of view of each camera. To address these challenges, we propose DMMA, a Distributed Multi-camera Multi-target Association for real-time tracking that employs a target management module coupled with a local data-structure containing the information on the targets. The target management module shares appearance and label information for each known target for inter-camera association. DMMA is designed as a distributed target association that allows a camera to join at any time, does not require cross-camera calibration, and can deal with target appearance and disappearance. The various parts of DMMA are validated using benchmark datasets and evaluation criteria. Moreover, we introduce a new mobile-camera dataset comprising six different scenes with moving cameras and objects, where DMMA achieves 92% MCTA on average. Experimental results show that the proposed tracker achieves a good association accuracy and speed trade-off by working at 32 frames per second (fps) with high definition (HD) videos.

using local and network information to obtain robustness to both occlusions and target appearance/disappearance. Moreover, a new camera joining the network can be fully operational after downloading the data-structure from the other nodes. A consensus among the cameras is obtained by sharing the data-structure variations across the network with decisions taken locally during association.
In summary, our main contributions are: • a target-representation that consists of both appearance and deep features; • a target-management module that deals with occlusions as well as targets entering/exiting the camera's FOVs; • a novel mobile-camera dataset comprising six different scenes with moving cameras and objects.

Related work
Target association in cameras networks deals with detection 8 , tracking 9 , re-identification 10 and distributed protocols 11 . We provide an overview of the main methods with a focus on those solutions designed for realtime implementations.
Camera networks Strategies for target association in camera networks can be categorized into centralized, distributed, and decentralized 11 . Most camera networks utilize a centralized approach where a server receives data from each camera in the network 12 . Although this strategy can exploit directly existing single-camera protocols (e.g. a single-camera tracker) by fusing the information centrally, the presence of a single fusion center leads to a lack of scalability and possibly to a communication bottleneck 13 . Distributed approaches operate with no fusion centers, thus improving the scalability and potentially reducing the communication bottlenecks. However, they are normally more complex protocols as they require to reach a consensus remotely. Distributed approaches for camera networks include a multi-target square-root cubature information consensus filter to increase tracking accuracy and stability 14 and an information weighted consensus filter for solving the data association problem 15 . Decentralized protocols instead are a hybrid solution between centralized and distributed, as cameras are grouped into clusters and they communicate with their local fusion centers only 16 . This solution may provide a more scalable solution than a fully centralized approach but less than a distributed. Schwager et al. 17 present a strategy  www.nature.com/scientificreports/ for the deployment of robotic cameras in a decentralized way, which can accommodate groups of cameras to monitor an environment. The majority of the solutions for camera networks focus on improving communication and how information are managed across the camera network while assuming targets are perfectly detected, tracked and re-identified 18,12,19 . However this may not be always the case. Graph modeling is an effective way to tackle object re-identification when the topology of camera network is known. Chen et al. 12 introduced a global graph model with in input different observations, such as detections, tracklets, trajectories or pairs. Cai et al. 18 utilized the topology information of a camera network to re-identify objects across camera views. Hofmann et al. 19 presented a global min-cost flow graph that joins the different-view detections.
Detection In order to properly associate multiple targets across a camera network, targets require to be detected in each of the cameras where they are visible 20 . Mobile cameras are challenging for background subtraction techniques since the background constantly changes, hence approaches based on learning the shape of the target are normally preferable 21 . Single-Shot Detector (SSD) 22 , You Only Look Once (YOLO) 23 MobileNet 24 and EfficientDet 25 are examples of target detectors with implementations that can run in real time and are based on detecting a shape learned during training.
Tracking Once the targets are detected, an identifier (ID) is assigned to each target and ideally kept over time and across all cameras. If a target is new to the network, then a new ID is created. Tracking and re-identification deal with assigning an ID in a single camera and across cameras, respectively, and while the main challenge of a tracker is to maintain the same ID to the same target over time, re-identification focuses on assigning the same ID to the same target seen by different cameras. A Multi-Object Tracking (MOT) framework for mobile cameras was proposed by Choi et al. 26 where both the camera's ego-motion and the objects' paths are estimated.
Detections can be linked with Markov Decision Processes (MDP) 27 , a Kalman filtering in the image space along with a frame-by-frame data association based on the Hungarian algorithm and weights obtained by the amount of bounding-box overlap (SORT) 28 , or by a Convolutional Neural Network (CNN) 29 . Graph-learning based methods 30,31 are effective in associating trajectories for the targets, but tend to fail in occlusion scenario. This problem can be dealt with by learning and updating the appearance of targets using a track management 32 or a person re-identification dataset 33 . In order to increase robustness, a self-supervised learning detector can be employed by combining re-identification feature 34 or by using the prediction of the motion 35 .

Re-identification
Re-identification techniques deal with illumination changes, and variations of viewpoint and pose, by extracting robust visual features describing the target, including color 36 , texture 37 and shape 38 features, or by deep learning 39 . The latter methods are normally more effective as they are capable of obtaining the most discriminative features for the targets, although they fail in scenarios different from the training set. A solution to this is reinforcement learning which allows an algorithm trained on a dataset to be tested on another dataset 40 . An unsupervised cross-dataset transfer learning approach was proposed in 41 , where an asymmetric multi-task dictionary model was learned to extract discriminative features from an unlabelled target data. Cheng et al. 42 introduced a transfer-metric learning approach with a shared latent subspace to describe the commonalities of persons in different datasets. Wang et al. 43 proposed a transferable joint attribute-identity deep learning, which simultaneously learns attributed labels and identity features across different datasets. Compared to the state-of-the-art methods, we deal with association by relying on a local database shared across the network in order to deal with continues changes of the appearance of a target and with cameras entering/exiting the network. Moreover, our algorithmic choices are made to optimize speed and enable a real-time implementation.

Proposed approach
Overview. Let C = {C 1 , . . . , C c , . . . , C N } be a network with N cameras and L = {l 1 , . . . , l l , . . . l L } be the set of possible target labels. Each camera C c has a local data-structure that stores the features for each target for the past J frames and is maintained up-to-date over time.
In order to operate in real time, a target-management module in each camera optimizes the assignment of the labels to the targets over time, and manages cameras leaving/joining the network.
For intra-camera tracking, each camera is equipped with target detection and tracking modules. As the latter has to be scale-invariant to cope with moving cameras and fast to maintain real-time, a trade-off has to be sought between fast trackers that may not be scale invariant 44 and scale-invariant trackers that may be slow 45 . The target-management module performs association between existing targets and detections in each camera, and inter-camera association with the features of the targets received from other cameras.

Remark 1
Our focus is to implement an efficient target association while assuming an ideal communication across cameras, namely the data transmission has no loss or delay. In our experiments, cameras exchange targets information, which are wrapped by .xml files, through the computer memory. See 46 for more details on nonideal communication.
Target descriptor. Let x l c (t) represent the features of target l l at time t in camera C c obtained by target detection and let a local data-structure in each C c maintains over time the features of each target for the past J frames. The features for target l l are defined as where H x l c (t) and D x l c (t) are the appearance and deep features of the target, respectively. H x l c (t) concatenates two RGB m-bin histograms H 1 , which are obtained on image patches of upper and lower parts of a target. The bins of the histogram are defined through a computationally efficient colour-naming (CN) approach following the insights of 47 that defines how CN is a strong visual attribute robust to intensity variations 48,49 when the discriminative RGB values are learned directly from public datasets.
Similarly to 47 , we choose m = 11 for its discriminating accuracy with bins representing black, blue, brown, grey, green, orange, pink, purple, red, white and yellow colours. Unlike 50 that employs same-size patches, we calculate the histograms on image patches with size adaptive to the target bounding box in order to deal with changes in target size. Let M and N be the bounding-box height and width, respectively, the side of an image patch is x l c (t) are each obtained on K/2 squared image patches, whose centre r is located as 50 : where N is a normal probability density function with mean µ = [M/2, N/2] and covariance matrix Colour histogram feature is insensitive to pose and shape deformation variation, because it utilizes the statistical information of the target. However, as the detected target images usually include background and occlusion, the statistical feature is not robust for real-world application. Deep learning based methods have been successfully applied in extracting discriminative feature for re-identification 51 . Although these methods achieve better accuracy, they are usually time-consuming. To achieve real-time processing, we use an efficient pre-trained backbone network to extract feature. The choice of backbone is explained in detail in "Experimental results" section.
As shown in Fig. 3, the appearance feature H x l c (t) concatenates upper and lower CN histograms and the deep feature D x l c (t) is extracted from a backbone network.
Target management. The target-management module performs association between existing targets and new target detections (intra-camera association), and between existing targets and new targets from the network (inter-camera association). The pairs of targets, i and j, considered for association are those with a high appearance-correlation where κ is the correlation function and, only for intra-camera association, spatial intersection-over-union of bounding boxes greater than γ . The more abrupt the illumination changes are expected in the scene, the lower ψ , and the faster the targets are expected to be and the lower the fps of the video stream is, the lower γ . Association is performed by the Hungarian Algorithm 7 and, in intra-camera association, detections not associated are considered new targets. A consensus among cameras is obtained by performing the intra-camera association, followed by the inter-camera association. This maintains the labels consistent over time for targets meeting the appearance-correlation constraint (Eq. 5). The target management module processes sequentially the inputs received by the network and shares in the network modifications on appearance (and label). Object features are updated in the data-structure as for intra-camera association, where x c (t) is the appearance feature of the associated detection, t ∈ {t − J, . . . , t − 1, t} and α f is the forgetting factor of each camera. A lower α f would result in a less discriminative feature vector, while a higher α f would make the tracking less responsive to appearance changes, thus producing drift. For inter-camera association, appearance features are updated with the data received from other cameras as: where x l c (t) is the appearance feature of the associated target with label l l from camera C c , t ∈ {t − J, . . . , t − 1, t} and α n is the network factor. The lower α n , the more the information from the network is considered.

Validation
Datasets and experimental setup. To validate the proposed method, we decided to run our experiments on people as target. Existing camera network datasets only contain static cameras where also the cameras topology is available, like PETS2009 52 , NLPR_MCT 12 , DukeMTMC 53 , however in order to properly test the proposed method, we require a dataset with targets moving continuously across cameras. To this aim, we introduce a new dataset that contains six scenes with up to four people recorded with two moving hand-held cameras, where people are annotated with a bounding box (using vbb 54 ). The diagrammatic overview of the six scenes is shown in Fig. 4. Videos are in HD (1280 × 720 pixels), running at 30 Hz and having more than 10,000 frames in total. www.nature.com/scientificreports/ In Scene 1 and 2, we have static people but they continuously enter/exit the cameras' FOVs due to the cameras motion, in Scene 3 and 4 people move and the illumination conditions change drastically, and in Scene 5 and 6 people move and occlude each other beside entering/exiting the cameras FOVs. The dataset is fully labeled. Each person in the sequences is manually annotated using the video bounding box (vbb) 54 . The annotations consist of position and size of the objects labeled with a unique ID.
For intra-camera tracking, we detect people with EfficientDet 25 which is faster than YOLO 23 and SSD 22 , and track them with Fast Compressive Tracking (FCT) 55 , chosen because of its speed (150 fps) and scale-invariant properties. FCT differentiates between target and background by calculating the likelihood of a nearby patch belonging to a target with an online Naive Bayes classifier. A convolution with Haar Filters 56 generates a highdimensional multi-scale feature vector, which is reduced by Compressive Sensing 55 . We initialize one FCT per EfficientDet detection and improve its performance by combining it with new detections obtained every δ frames or when the FCT tracking confidence, φ , is lower than a threshold β . DMMA can run live but the validation in this section is performed on video datasets to allow a proper analysis. DMMA is instantiated with δ = 5 frames, J = 2 frames, α f = 0.5 , α n = 0.2 , γ = 0.2 , ψ = 0.4 and K = 48, and FCT with β = 0.4.
We implement all experiments using the same system, whose configuration is shown in Table 1.

Performance measures.
To evaluate the performance of target descriptors, we use Cumulative Matching Characteristic (CMC) curves 57 as the evaluation criteria, which is defined as a function of Rank-r: where |P g | represents the total number of images in the gallery, and the query set C(r) is defined as: Since most intra-camera tracking algorithms usually use the multi-object tracking metrics as their evaluation criteria, we utilize the evaluation metrics defined in 58 . These include number of False Positives (FP), number of C(r) = p i : rank(p i ) ≤ r ∀p i ∈ P g . where p = 1 − t f t t h t is the precision, r = 1 − t i t t g t is the recall, and m t , u t , f t , h t , i t and g t are the number of ID switches, true positives, false positives, trajectory hypothesises, misses and ground truths at time t, respectively, and where s and c denote matches within the same and across cameras, respectively. MCTA ranges between 0 and 1 (the higher MCTA, the better the performance). Speed is measured in frames per seconds (fps) on the algorithms.
Experimental results. In this section, we firstly evaluate the target representation, the intra-camera, and the inter-camera tracking performances. Then we analyze the impact of parameters and compare with state-ofthe-art methods on MOT16 dataset. Finally, the qualitative results are depicted. Table 2 57 and speed, on 600 pairs of images distributed among different targets and case difficulty (e.g. due to occlusions or lighting changes) of the proposed dataset. As can be observed, the NASNet has the best performance with 94.2% of queries resulting in rank 1 correct match. CN + MobileNet is second with approximately 92.1% of the queries resulting in rank 1 correct match and 98.3% in the 30 top ranked. However, the speed of NASNet (12.5 fps) is two times slower than ours (28.1 fps). Thus, the proposed CN + MobileNet shows the best trade-off in terms of performance and speed.

Intra-camera tracking performance
We compare the proposed method against DeepSORT 29 , MDP 27 , MFI_tst 35 and FairMOT 34 , for intra-camera tracking. As DMMA would use information across cameras, we perform a comparison with DMMA run as an intra-camera tracker, such as with no inter-camera communications (DMMA-nc). We also compare DMMA against detector and Hungarian Algorithm at every frame with no FCT tracking (DMMA-nt). DMMA-nc and DMMA-nt are baselines optimized for the task-at-hand. Table 3 compares intra-camera tracking results. DMMA-nc is the only method running in real-time (32 fps), while maintaining the best average MOTA. In the most difficult scenes in terms of colour changes and heavy occlusions (scenes 3, 5 and 6), DeepSORT drops accuracy with respect to MDP and DMMA-nc, while FairMOT shows comparable results with respect to DMMA-nc but cannot reach a real-time performance. Where FairMOT and MDP have a higher MOTA, DMMA-nc has a comparable accuracy. Figure 5 shows sample tracking results on the proposed datasets. Table 4 reports the inter-camera association results. DMMA has a higher MCTA than DMMA-nt and DMMA-nc. DMMA-nc performs better than DMMA-nt, but worse than DMMA, thus validating the use of information from the network. The result of DMMA (MCTA 63.9) on scene 3 which has heavy illumination changes can be considered satisfactory, given that no explicit cross-camera calibration or training is performed.

Inter-camera tracking performance
In terms of speed, DMMA achieves 32 fps, only 1 fps slower than DMMA-nc which does not receive data from the network. Note that DMMA-nc and DMMA have a higher standard deviation due to the variability of the target search performed by FCT. As we performed all the tests with display on for the analysis of the results, we also tested the proposed solution with no display to simulate how the implementation would perform if deployed with no screens (when they are not required or available in a system). In this case, the speed increases by about 24% on average.  www.nature.com/scientificreports/   www.nature.com/scientificreports/ Impact of parameters Table 5 shows the impact of detection frequent δ and maintaining frame number J on our dataset. As we can observe that too large δ and J lead to degradation of accuracy, which indicates drift caused without the detector's correction over a long duration. However, smaller δ results in recalling detector and initializing trackers frequently, which is time-consuming. Consequentially, we set δ = 5 and J = 2 to strike a good balance between speed and accuracy. We further perform a sensitivity analysis for ψ , γ , α f and α n , and, on average, results remain substantially unchanged in our experiments with a 10% variation.

Performance on MOT16
We compare DMMA-nc with state-of-the-art MOT trackers including one-shot (Fair-MOT) and two-step (DeepSort 29 and MFI_tst 35 ) MOT trackers. Following FairMOT 34 , we pre-train the detector on the CrowdHuman dataset 60 . Table 6 shows the performance results. Due to the robustness of proposed target representation, we have the lowest IDs within comparative trackers. This demonstrates that we obtain consistent trajectories of objects. Also, DMMA-nc has the second highest MOTA score and IDF1. This can be attributed to the proposed target management maintaining object association in spite of occlusions and entrance/exiting of camera FoVs. Although FairMOT out-performances DMMA-nc in MOT metrics, the main contribution of DMMA is to devise a data association among mobile camera network without cross-camera calibration.
Qualitative results Finally, qualitative results are shown in Fig. 6. In Fig. 6e, f, we can appreciate the heavy illumination change in Scene 3 that leads to a wrong label assignment in Camera 1 while tracking performs well in Camera 2. In Fig. 6h, although Target 2 is completely occluded by Target 4, the method can properly assign the correct label. Similarly, in Fig. 6k the correct labels are assigned even when the targets are not entirely visible. However labels 5 and labels 6 are wrongly assigned due to the very dark conditions created in the scene.

Conclusion
We presented a target-management module for multi-camera multi-target tracking for a moving-camera network that runs in real-time reaching 32 fps on HD videos. The tracker, DMMA, allows cameras to join or leave without affecting the network's performance along with targets that are re-identified when re-entering the camera's FOVs. The tracker can also deal with heavy occlusions and targets at different scales. Experiments were performed on a new mobile-camera dataset and public MOT dataset. Experiment results demonstrate the proposed approach performs well in terms of accuracy, effectiveness and speed. As future work, we will extend the validation to other camera networks with a variable number of cameras and with a real communication channel.

Data availibility
The datasets used and analysed during the current study available from the corresponding author on reasonable request.